Project Stage 2¶

Data Analysis in Predict House Pricing¶

Introduction¶

The aim of this notebook is to build some models that can predict NSW housing prices based on a set of scrapped features made available in the NSW Housing Dataset. The current dataset has recently been updated, so it's interesting to explore the significance of new features, and their impact on model accuracy.

We obtained two dataset from Kaggle (www.kaggle.com/datasets/karltse/sydney-suburbs-reviews) (www.kaggle.com/datasets/alexlau203/sydney-house-prices). As a continuation of the stage 1, both datasets have information that will help us to analyse and answer some questions. The purpose of the stage 2 is to answer the questions from the previous stage.

Questions¶

How to compare the performance of different machine learning models? What is the predicted prices and actual prices of the houses and elaborate the relation and trends? What does Sydney houses pricing dataset show about the prices of the houses over different period? What is overall trend of highest and lowest percentage of ethnics of language speakers in five different suburbs? What is the relationship between the population of a suburb and its review score? What can be the main factor of properties expensiveness? What is the relation between sales prices of home having low to higher numbers of beds and prices of home which are near to CBD? What is the relationship between the sales price of properties in Sydney and factors such as the type of region, number of beds, property size, distance from the CBD, median income of the suburb, and population of the suburb? How do the house prices in Sydney vary by suburb according to the data in the "sydney-house-prices" dataset?

Description of the Data¶

Description of Dataset 1¶

Dataset1=pd.read_csv('domain_properties.csv')

  1. price is the price of the property in AUD
  2. date_sold is the date the property was sold
  3. suburb is the suburb the property is situated in
  4. num_bath is the number of bathrooms in the property
  5. num_bed is the number of bedrooms in the property
  6. num_parking is the number of parking spaces on the property
  7. property_size is the size of the property in square metres
  8. type is the type of building
  9. suburb_population is the population of the suburb the property is situated in
  10. suburb_median_income is the median income of the suburb the property is situated in
  11. srburb_lat is the latitude of the suburb that the property is situated in
  12. suburb_lng is the longitude of the suburb that the property is situated in
  13. suburb_elevation is the elevation of the suburb that the property is situated in.
  14. cash_rate is the cash rate at the time the property was sold
  15. property_inflation_index is the residential property price inflation index of the quarter that the property was sold in
  16. km_from_cbd is the distance between the property and the centre of Sydney CBD.

Description of Dataset 2¶

Dataset2=pd.read_csv('Sydney Suburbs Reviews.csv')

  1. Name is suburb name
  2. Region is NSW regions
  3. Population (rounded)* is suburb population
  4. Postcode is Postcode in each suburbs
  5. Ethnic Breakdown 2016 is type of ethnic in NSW
  6. Median House Price (2020) is median properties price in 2020
  7. Median House Price (2021) is median properties price in 20201
  8. % Change is price between 2020-2021
  9. Median House Rent (per week) is renting house fee per week
  10. Median Apartment Price (2020) is selling apartment fee in 2020
  11. Median Apartment Rent (per week)is renting apartment fee per week
  12. Public Housing % is % of public housing
  13. Avg. Years Held is heldig avg years
  14. Time to CBD (Public Transport) [Town Hall St] is duration of transport to CBD
  15. Time to CBD (Driving) [Town Hall St] is duration of driving to CBD
  16. Nearest Train Station is nearest train station
  17. Highlights/Attractions is popular palace
  18. Ideal for is type of ideal
  19. Traffic is avg of traffic jam
  20. Public Transport is rating of Public Transport
  21. Affordability (Rental)is rating of Affordability (Rental)
  22. Affordability (Buying)is rating of Affordability (Buying)
  23. Nature is rating of nature
  24. Noise is rating of noise
  25. Things to See/Do is rating of Things to see
  26. Family-Friendliness is rating of Family-Friendliness
  27. Pet Friendliness is rating of Pet Friendliness
  28. Safety is rating of Safety
  29. Overall Rating is rating of Overall
  30. Review Link is rating of website link

Data Preparation¶

Loading Data

In the below. two data set has been uploaded. Sydney suburb reviews and second one is domain properties. while in second data sheet it has two years data 2020 and 2021 including suburb details about ethnics with different languages and the average change of prices between 2020 and 2021.

In [30]:
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import re
import plotly.express as px
import seaborn as sns

Dataset1=pd.read_csv('domain_properties.csv',header=0)
Dataset2=pd.read_csv('Sydney Suburbs Reviews.csv',header=0)

The first data set shows sydeny different suburb housing costs, property sizes, when they are sold, number of baths, bedrooms, parking slots, suburb population and also suburb average income with property inflation index.

In [31]:
data_price = Dataset1.copy()
data_price
Out[31]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd
0 530000 13/1/16 Kincumber 4 4 2 1351 House 7093 29432 9.914 -33.47252 151.40208 24 2.0 150.9 47.05
1 525000 13/1/16 Halekulani 2 4 2 594 House 2538 24752 1.397 -33.21772 151.55237 23 2.0 150.9 78.54
2 480000 13/1/16 Chittaway Bay 2 4 2 468 House 2028 31668 1.116 -33.32678 151.44557 3 2.0 150.9 63.59
3 452000 13/1/16 Leumeah 1 3 1 344 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12
4 365500 13/1/16 North Avoca 0 0 0 1850 Vacant land 2200 45084 1.497 -33.45608 151.43598 18 2.0 150.9 49.98
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11155 1900000 31/12/21 Kellyville 3 4 2 540 House 27971 46228 18.645 -33.69583 150.95622 78 0.1 220.1 30.08
11156 1300000 31/12/21 Seven Hills 3 7 2 1208 House 19326 33540 9.629 -33.77743 150.94272 38 0.1 220.1 26.58
11157 1025000 31/12/21 Sydney 2 2 1 129 Apartment / Unit / Flat 17252 35412 2.940 -33.86794 151.20998 65 0.1 220.1 0.31
11158 1087500 1/1/22 Prestons 2 4 2 384 House 15313 36244 9.215 -33.94155 150.87334 28 0.1 220.1 32.26
11159 1000000 1/1/22 Ourimbah 2 3 2 667 House 3951 37180 87.154 -33.31517 151.32611 191 0.1 220.1 61.95

11160 rows × 17 columns

The second data set shows the data of different suburb with general regions indicating population with ethnic breakdown with two years (2020 and 2021) different prices. Furthermore the average distance to nearest station, most attractive places around. trafffic and public transport points etc...

In [32]:
data_ethnic=Dataset2.copy()
data_ethnic.head()
Out[32]:
Name Region Population (rounded)* Postcode Ethnic Breakdown 2016 Median House Price (2020) Median House Price (2021) % Change Median House Rent (per week) Median Apartment Price (2020) ... Affordability (Rental) Affordability (Buying) Nature Noise Things to See/Do Family-Friendliness Pet Friendliness Safety Overall Rating Review Link
0 Hornsby Upper North Shore 23,000 2077 Chinese 17.1%, English 16.8%, Australian 14.0%... $1,150,000.00 $1,400,000.00 21.74% $600.00 $645,000.00 ... 7.0 6.0 8.0 7.0 7.0 9.0 8.0 9.0 7.1 https://sydneysuburbreviews.com/hornsby
1 Oatley Southern Suburbs 10,500 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,... $1,550,000.00 $1,800,000.00 16.13% $670.00 $780,000.00 ... 6.0 4.0 8.0 9.0 5.0 9.0 8.0 9.0 7.0 https://sydneysuburbreviews.com/oatley-suburb-...
2 Dulwich Hill Inner West 14,500 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S... $1,500,000.00 $1,900,000.00 26.67% $725.00 $780,000.00 ... 6.0 4.0 8.0 7.0 6.0 8.0 9.0 8.0 6.9 https://sydneysuburbreviews.com/dulwich-hill
3 Jannali Sutherland Shire 6,250 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S... $1,050,000.00 $1,300,000.00 23.81% $610.00 $620,000.00 ... 7.0 7.0 7.0 8.0 2.0 8.0 8.0 9.0 6.8 https://sydneysuburbreviews.com/jannali
4 Waverton Lower North Shore 3,250 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ... $2,650,000.00 $3,400,000.00 28.30% $1,000.00 $1,200,000.00 ... 5.0 2.0 8.0 9.0 4.0 9.0 8.0 9.0 6.8 https://sydneysuburbreviews.com/waverton

5 rows × 30 columns

Explore Feature¶

In [33]:
#dataset 1
data_price.info()
print('Number of instances = %d' % (data_price.shape[0]))
print('Number of attributes = %d' % (data_price.shape[1]))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11160 entries, 0 to 11159
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   price                     11160 non-null  int64  
 1   date_sold                 11160 non-null  object 
 2   suburb                    11160 non-null  object 
 3   num_bath                  11160 non-null  int64  
 4   num_bed                   11160 non-null  int64  
 5   num_parking               11160 non-null  int64  
 6   property_size             11160 non-null  int64  
 7   type                      11160 non-null  object 
 8   suburb_population         11160 non-null  int64  
 9   suburb_median_income      11160 non-null  int64  
 10  suburb_sqkm               11160 non-null  float64
 11  suburb_lat                11160 non-null  float64
 12  suburb_lng                11160 non-null  float64
 13  suburb_elevation          11160 non-null  int64  
 14  cash_rate                 11160 non-null  float64
 15  property_inflation_index  11160 non-null  float64
 16  km_from_cbd               11160 non-null  float64
dtypes: float64(6), int64(8), object(3)
memory usage: 1.4+ MB
Number of instances = 11160
Number of attributes = 17
In [34]:
#dataset 2
data_ethnic.info()
print('Number of instances = %d' % (data_ethnic.shape[0]))
print('Number of attributes = %d' % (data_ethnic.shape[1]))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421 entries, 0 to 420
Data columns (total 30 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   Name                                           421 non-null    object 
 1   Region                                         421 non-null    object 
 2   Population (rounded)*                          421 non-null    object 
 3   Postcode                                       421 non-null    int64  
 4   Ethnic Breakdown 2016                          421 non-null    object 
 5   Median House Price (2020)                      420 non-null    object 
 6   Median House Price (2021)                      409 non-null    object 
 7   % Change                                       421 non-null    object 
 8   Median House Rent (per week)                   402 non-null    object 
 9   Median Apartment Price (2020)                  299 non-null    object 
 10  Median Apartment Rent (per week)               350 non-null    object 
 11  Public Housing %                               419 non-null    object 
 12  Avg. Years Held                                370 non-null    float64
 13  Time to CBD (Public Transport) [Town Hall St]  417 non-null    object 
 14  Time to CBD (Driving) [Town Hall St]           419 non-null    object 
 15  Nearest Train Station                          350 non-null    object 
 16  Highlights/Attractions                         114 non-null    object 
 17  Ideal for                                      116 non-null    object 
 18  Traffic                                        419 non-null    float64
 19  Public Transport                               420 non-null    float64
 20  Affordability (Rental)                         420 non-null    float64
 21  Affordability (Buying)                         420 non-null    float64
 22  Nature                                         420 non-null    float64
 23  Noise                                          420 non-null    float64
 24  Things to See/Do                               420 non-null    float64
 25  Family-Friendliness                            420 non-null    float64
 26  Pet Friendliness                               420 non-null    float64
 27  Safety                                         420 non-null    float64
 28  Overall Rating                                 420 non-null    float64
 29  Review Link                                    114 non-null    object 
dtypes: float64(12), int64(1), object(17)
memory usage: 98.8+ KB
Number of instances = 421
Number of attributes = 30
In [35]:
#dataset 1
data_price.describe().T
Out[35]:
count mean std min 25% 50% 75% max
price 11160.0 1.675395e+06 1.290371e+06 225000.00000 1.002000e+06 1.388000e+06 2.020000e+06 6.000000e+07
num_bath 11160.0 2.073566e+00 1.184881e+00 0.00000 1.000000e+00 2.000000e+00 3.000000e+00 4.600000e+01
num_bed 11160.0 3.758961e+00 1.559743e+00 0.00000 3.000000e+00 4.000000e+00 4.000000e+00 4.700000e+01
num_parking 11160.0 2.017473e+00 1.454560e+00 0.00000 1.000000e+00 2.000000e+00 2.000000e+00 5.000000e+01
property_size 11160.0 7.230124e+02 1.048984e+03 7.00000 4.300000e+02 6.000000e+02 7.650000e+02 5.910000e+04
suburb_population 11160.0 9.311560e+03 7.541636e+03 22.00000 3.977000e+03 7.457000e+03 1.215825e+04 4.717600e+04
suburb_median_income 11160.0 4.016824e+04 1.108996e+04 14248.00000 3.244800e+04 3.910400e+04 4.555200e+04 9.750000e+04
suburb_sqkm 11160.0 5.054877e+00 5.824663e+00 0.08900 1.776000e+00 3.566000e+00 6.568000e+00 8.715400e+01
suburb_lat 11160.0 -3.378141e+01 2.024778e-01 -34.10624 -3.392148e+01 -3.380918e+01 -3.371551e+01 -3.316376e+01
suburb_lng 11160.0 1.510967e+02 2.134562e-01 150.55384 1.509510e+02 1.511095e+02 1.512278e+02 1.515733e+02
suburb_elevation 11160.0 5.560672e+01 5.280232e+01 0.00000 2.100000e+01 4.000000e+01 7.500000e+01 4.050000e+02
cash_rate 11160.0 6.313611e-01 6.586239e-01 0.10000 1.000000e-01 1.100000e-01 1.500000e+00 2.000000e+00
property_inflation_index 11160.0 1.884897e+02 2.444155e+01 150.90000 1.676000e+02 1.766000e+02 2.201000e+02 2.201000e+02
km_from_cbd 11160.0 2.738183e+01 1.847011e+01 0.31000 1.296000e+01 2.231000e+01 4.099000e+01 8.479000e+01
In [36]:
#dataset 2
data_ethnic.describe().T
Out[36]:
count mean std min 25% 50% 75% max
Postcode 421.0 2232.973872 235.654897 2000.0 2076.000 2153.0 2225.000 2770.0
Avg. Years Held 370.0 12.915676 2.684144 1.7 11.425 13.0 14.475 22.7
Traffic 419.0 1.105012 1.892683 0.0 0.000 0.0 2.500 7.0
Public Transport 420.0 1.683333 2.817810 0.0 0.000 0.0 4.000 9.0
Affordability (Rental) 420.0 1.400000 2.336184 0.0 0.000 0.0 3.250 7.0
Affordability (Buying) 420.0 1.102381 1.965452 0.0 0.000 0.0 2.000 7.0
Nature 420.0 1.876190 3.104221 0.0 0.000 0.0 5.000 10.0
Noise 420.0 1.680952 2.849967 0.0 0.000 0.0 4.000 10.0
Things to See/Do 420.0 1.585714 2.765244 0.0 0.000 0.0 3.000 10.0
Family-Friendliness 420.0 2.071429 3.393050 0.0 0.000 0.0 5.000 10.0
Pet Friendliness 420.0 2.057143 3.366488 0.0 0.000 0.0 5.250 9.0
Safety 420.0 2.045238 3.376226 0.0 0.000 0.0 5.000 10.0
Overall Rating 420.0 1.659810 2.675546 0.0 0.000 0.0 5.125 7.1

Missing data¶

In [37]:
#dataset 1
print('Counting missing data for each feature')
data_price.isnull().sum()
Counting missing data for each feature
Out[37]:
price                       0
date_sold                   0
suburb                      0
num_bath                    0
num_bed                     0
num_parking                 0
property_size               0
type                        0
suburb_population           0
suburb_median_income        0
suburb_sqkm                 0
suburb_lat                  0
suburb_lng                  0
suburb_elevation            0
cash_rate                   0
property_inflation_index    0
km_from_cbd                 0
dtype: int64

in this they are counting missing data for each of the feature. so as we can see there are different missing data. there are no mmissing data and also there are many of missing data also.

In [38]:
#dataset 2
print('Counting missing data for each feature')
data_ethnic.isnull().sum()
Counting missing data for each feature
Out[38]:
Name                                               0
Region                                             0
Population (rounded)*                              0
Postcode                                           0
Ethnic Breakdown 2016                              0
Median House Price (2020)                          1
Median House Price (2021)                         12
% Change                                           0
Median House Rent (per week)                      19
Median Apartment Price (2020)                    122
Median Apartment Rent (per week)                  71
Public Housing %                                   2
Avg. Years Held                                   51
Time to CBD (Public Transport) [Town Hall St]      4
Time to CBD (Driving) [Town Hall St]               2
Nearest Train Station                             71
Highlights/Attractions                           307
Ideal for                                        305
Traffic                                            2
Public Transport                                   1
Affordability (Rental)                             1
Affordability (Buying)                             1
Nature                                             1
Noise                                              1
Things to See/Do                                   1
Family-Friendliness                                1
Pet Friendliness                                   1
Safety                                             1
Overall Rating                                     1
Review Link                                      307
dtype: int64

Visualizing missing data¶

Here Dataset2 is showing the visual image of the missing data. in form of number vertically and different factors horizontally

In [39]:
print('Visualizing missing data')
msno.bar(data_ethnic)
Visualizing missing data
Out[39]:
<AxesSubplot:>

Data Cleaning¶

Cleaning Dataset2¶

in this the overall data cleaning has been started with removing properties other than houses, removing suburbs having postal code except those start with 2 having four numbers. replacing missing values by zero. handing outliers which are lied outside the zone. and also at the end creating a new coloumn for analysing.

Data set 2 has been claned the same way as above. just leaving rounded population of suburbs. average years held, median prices, things to see, family frindlinesss, safety etc...

Cleaning ethnic data with different languages which includes chinese speaking english australian irish and many more.

In [40]:
clean_data_ethnic=data_ethnic.drop(['Population (rounded)*','Avg. Years Held','Median House Price (2020)','Median House Price (2021)','% Change','Median House Rent (per week)','Median Apartment Price (2020)','Median Apartment Rent (per week)','Highlights/Attractions','Time to CBD (Public Transport) [Town Hall St]','Time to CBD (Driving) [Town Hall St]','Public Housing %','Median House Price (2020)','Median House Price (2021)','% Change','Median Apartment Price (2020)','Ideal for','Traffic','Public Transport','Affordability (Rental)','Affordability (Buying)',
            'Nature','Noise','Things to See/Do','Family-Friendliness','Pet Friendliness','Safety','Review Link', 'Overall Rating','Nearest Train Station'], axis=1)
clean_data_ethnic
Out[40]:
Name Region Postcode Ethnic Breakdown 2016
0 Hornsby Upper North Shore 2077 Chinese 17.1%, English 16.8%, Australian 14.0%...
1 Oatley Southern Suburbs 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,...
2 Dulwich Hill Inner West 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S...
3 Jannali Sutherland Shire 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S...
4 Waverton Lower North Shore 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ...
... ... ... ... ...
416 Kyeemagh Southern Suburbs 2216 Greek 19.7%, Australian 14.0%, English 10.0%, ...
417 Northwood Lower North Shore 2066 English 26.4%, Australian 23.0%, Irish 12.3%, ...
418 Watsons Bay Eastern Suburbs 2030 English 26.0%, Australian 19.0%, Irish 11.2%, ...
419 Lavender Bay Lower North Shore 2060 English 25.8%, Australian 15.1%, Irish 13.7%, ...
420 Bexley North Southern Suburbs 2207 Greek 17.5%, Chinese 13.3%, Australian 12.0%, ...

421 rows × 4 columns

In [41]:
type(clean_data_ethnic)
Out[41]:
pandas.core.frame.DataFrame

Rename column for consistency¶

In [45]:
clean_data_ethnic.rename(columns={'Name':'Suburb', 'Ethnic Breakdown 2016':'Ethnic'}, inplace=True)
cethnic=clean_data_ethnic.copy()
cethnic
Out[45]:
Suburb Region Postcode Ethnic
0 Hornsby Upper North Shore 2077 Chinese 17.1%, English 16.8%, Australian 14.0%...
1 Oatley Southern Suburbs 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,...
2 Dulwich Hill Inner West 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S...
3 Jannali Sutherland Shire 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S...
4 Waverton Lower North Shore 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ...
... ... ... ... ...
416 Kyeemagh Southern Suburbs 2216 Greek 19.7%, Australian 14.0%, English 10.0%, ...
417 Northwood Lower North Shore 2066 English 26.4%, Australian 23.0%, Irish 12.3%, ...
418 Watsons Bay Eastern Suburbs 2030 English 26.0%, Australian 19.0%, Irish 11.2%, ...
419 Lavender Bay Lower North Shore 2060 English 25.8%, Australian 15.1%, Irish 13.7%, ...
420 Bexley North Southern Suburbs 2207 Greek 17.5%, Chinese 13.3%, Australian 12.0%, ...

421 rows × 4 columns

Splitting column 'Ethnic'¶
In [46]:
ethic_english=[]
for ethnic in cethnic['Ethnic']:
    num = re.search(r'English'+'\s+(\d*\.\d+|\d+)', ethnic)
    ethic_english.append(str(num))

ethic_australian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Australian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_australian.append(str(num))

ethic_chinese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Chinese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_chinese.append(str(num)) 
ethic_greek=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Greek\s+(\d*\.\d+|\d+)', ethinc)
    ethic_greek.append(str(num))
ethic_irish=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Irish\s+(\d*\.\d+|\d+)', ethinc)
    ethic_irish.append(str(num))
ethic_indian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Indian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_indian.append(str(num))
ethic_scottish=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Scottish\s+(\d*\.\d+|\d+)', ethinc)
    ethic_scottish.append(str(num))
ethic_italian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Italian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_italian.append(str(num))
ethic_nepalese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Nepalese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_nepalese.append(str(num))
ethic_korean=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Korean\s+(\d*\.\d+|\d+)', ethinc)
    ethic_korean.append(str(num))
ethic_lebanese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Lebanese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_lebanese.append(str(num))
ethic_mongolian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Mongolian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_mongolian.append(str(num))
ethic_chineseScottish=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'ChineseScottish\s+(\d*\.\d+|\d+)', ethinc)
    ethic_chineseScottish.append(str(num))
ethic_vietnamese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Vietnamese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_vietnamese.append(str(num))
ethic_thai=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Thai\s+(\d*\.\d+|\d+)', ethinc)
    ethic_thai.append(str(num))
    ethic_filipino=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Filipino\s+(\d*\.\d+|\d+)', ethinc)
    ethic_filipino.append(str(num))

    ethic_turkish=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Turkish\s+(\d*\.\d+|\d+)', ethinc)
    ethic_turkish.append(str(num))
    ethic_iraqi=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Iraqi\s+(\d*\.\d+|\d+)', ethinc)
    ethic_iraqi.append(str(num))
    ethic_maltese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Maltese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_maltese.append(str(num))
    ethic_khmerCambodian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'KhmerCambodian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_khmerCambodian.append(str(num))
    ethic_assyrian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Assyrian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_assyrian.append(str(num))
    ethic_bangladeshi=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Bangladeshi\s+(\d*\.\d+|\d+)', ethinc)
    ethic_bangladeshi.append(str(num))
    ethic_indonesian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Indonesian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_indonesian.append(str(num))
    ethic_sriLankan=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'SriLankan\s+(\d*\.\d+|\d+)', ethinc)
    ethic_sriLankan.append(str(num))
    ethic_samoan=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Samoan\s+(\d*\.\d+|\d+)', ethinc)
    ethic_samoan.append(str(num))
    ethic_german=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'German\s+(\d*\.\d+|\d+)', ethinc)
    ethic_german.append(str(num))
    ethic_scottishChinese=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'ScottishChinese\s+(\d*\.\d+|\d+)', ethinc)
    ethic_scottishChinese.append(str(num))
    ethic_macedonian=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Macedonian\s+(\d*\.\d+|\d+)', ethinc)
    ethic_macedonian.append(str(num))
    ethic_afghann=[]
for ethinc in cethnic['Ethnic']:
    num = re.search(r'Afghan\s+(\d*\.\d+|\d+)', ethinc)
    ethic_afghann.append(str(num))
In [47]:
# add 2 new columns to the dataframe
cethnic['English'] = ethic_english
cethnic['Australian'] = ethic_australian
cethnic['Chinese'] = ethic_chinese
cethnic['Greek'] = ethic_greek
cethnic['Irish'] = ethic_irish
cethnic['Indian'] = ethic_indian

cethnic['Scottish']=ethic_scottish

cethnic['Italian']=ethic_italian 
 
cethnic['Nepalese']=ethic_nepalese
 
cethnic['Korean']=ethic_korean

cethnic['Lebanese']=ethic_lebanese

cethnic['Mongolian']=ethic_mongolian

cethnic['ChineseScottish']=ethic_chineseScottish

cethnic['Vietnamese']=ethic_vietnamese

cethnic['Thai']=ethic_thai

cethnic['Filipino']=   ethic_filipino


cethnic['Turkish']=    ethic_turkish

cethnic['Iraqi']=    ethic_iraqi

cethnic['Maltese']=    ethic_maltese

cethnic['KhmerCambodian']=    ethic_khmerCambodian

cethnic['Assyrian']=    ethic_assyrian

cethnic['Bangladeshi']=    ethic_bangladeshi

cethnic['Indonesian']=    ethic_indonesian

cethnic['SriLankan']=    ethic_sriLankan

cethnic['Samoan']=    ethic_samoan

cethnic['German']=    ethic_german

cethnic['ScottishChinese']=    ethic_scottishChinese

cethnic['Macedonian']=    ethic_macedonian

cethnic['Afghan']=    ethic_afghann

cethnic
Out[47]:
Suburb Region Postcode Ethnic English Australian Chinese Greek Irish Indian ... KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 Hornsby Upper North Shore 2077 Chinese 17.1%, English 16.8%, Australian 14.0%... <re.Match object; span=(15, 27), match='Englis... <re.Match object; span=(30, 45), match='Austra... <re.Match object; span=(0, 12), match='Chinese... None <re.Match object; span=(61, 70), match='Irish ... <re.Match object; span=(48, 58), match='Indian... ... None None None None None None None None None None
1 Oatley Southern Suburbs 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... <re.Match object; span=(33, 44), match='Chines... None <re.Match object; span=(47, 56), match='Irish ... None ... None None None None None None None None None None
2 Dulwich Hill Inner West 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... None <re.Match object; span=(60, 69), match='Greek ... <re.Match object; span=(33, 42), match='Irish ... None ... None None None None None None None None None None
3 Jannali Sutherland Shire 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... <re.Match object; span=(60, 71), match='Chines... None <re.Match object; span=(33, 42), match='Irish ... None ... None None None None None None None None None None
4 Waverton Lower North Shore 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... <re.Match object; span=(46, 57), match='Chines... None <re.Match object; span=(33, 43), match='Irish ... None ... None None None None None None None None None None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
416 Kyeemagh Southern Suburbs 2216 Greek 19.7%, Australian 14.0%, English 10.0%, ... <re.Match object; span=(31, 43), match='Englis... <re.Match object; span=(13, 28), match='Austra... None <re.Match object; span=(0, 10), match='Greek 1... None None ... None None None None None None None None None None
417 Northwood Lower North Shore 2066 English 26.4%, Australian 23.0%, Irish 12.3%, ... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... <re.Match object; span=(61, 72), match='Chines... None <re.Match object; span=(33, 43), match='Irish ... None ... None None None None None None None None None None
418 Watsons Bay Eastern Suburbs 2030 English 26.0%, Australian 19.0%, Irish 11.2%, ... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... None None <re.Match object; span=(33, 43), match='Irish ... None ... None None None None None None <re.Match object; span=(61, 71), match='German... None None None
419 Lavender Bay Lower North Shore 2060 English 25.8%, Australian 15.1%, Irish 13.7%, ... <re.Match object; span=(0, 12), match='English... <re.Match object; span=(15, 30), match='Austra... <re.Match object; span=(61, 72), match='Chines... None <re.Match object; span=(33, 43), match='Irish ... None ... None None None None None None None None None None
420 Bexley North Southern Suburbs 2207 Greek 17.5%, Chinese 13.3%, Australian 12.0%, ... <re.Match object; span=(46, 58), match='Englis... <re.Match object; span=(28, 43), match='Austra... <re.Match object; span=(13, 25), match='Chines... <re.Match object; span=(0, 10), match='Greek 1... None None ... None None None None None None None None None None

421 rows × 33 columns

In [48]:
cethnic['English'] = cethnic['English'].str[-6:-2]
cethnic['Australian'] = cethnic['Australian'].str[-6:-2]
cethnic['Chinese'] = cethnic['Chinese'].str[-6:-2]
cethnic['Greek'] = cethnic['Greek'] .str[-6:-2]
cethnic['Irish'] = cethnic['Irish'].str[-6:-2]
cethnic['Indian'] = cethnic['Indian'].str[-6:-2]
cethnic['Scottish']=cethnic['Scottish'].str[-6:-2]

cethnic['Italian']=cethnic['Italian'].str[-6:-2]
 
cethnic['Nepalese']=cethnic['Nepalese'].str[-6:-2]
 
cethnic['Korean']=cethnic['Korean'].str[-6:-2]

cethnic['Lebanese']=cethnic['Lebanese'].str[-6:-2]

cethnic['Mongolian']=cethnic['Mongolian'].str[-6:-2]

cethnic['ChineseScottish']=cethnic['ChineseScottish'].str[-6:-2]

cethnic['Vietnamese']=cethnic['Vietnamese'].str[-6:-2]

cethnic['Thai']=cethnic['Thai'].str[-6:-2]

cethnic['Filipino']=cethnic['Filipino'].str[-6:-2]


cethnic['Turkish']=cethnic['Turkish'].str[-6:-2]

cethnic['Iraqi']=cethnic['Iraqi'].str[-6:-2]

cethnic['Maltese']=cethnic['Maltese'].str[-6:-2]

cethnic['KhmerCambodian']= cethnic['KhmerCambodian'].str[-6:-2]

cethnic['Assyrian']=cethnic['Assyrian'].str[-6:-2]

cethnic['Bangladeshi']=cethnic['Bangladeshi'].str[-6:-2]

cethnic['Indonesian']=cethnic['Indonesian'].str[-6:-2]

cethnic['SriLankan']=cethnic['SriLankan'].str[-6:-2]

cethnic['Samoan']=cethnic['Samoan'].str[-6:-2]

cethnic['German']=cethnic['German'].str[-6:-2]

cethnic['ScottishChinese']=cethnic['ScottishChinese'].str[-6:-2]
cethnic['Macedonian']=cethnic['Macedonian'].str[-6:-2]

cethnic['Afghan']=cethnic['Afghan'].str[-6:-2]
cethnic
Out[48]:
Suburb Region Postcode Ethnic English Australian Chinese Greek Irish Indian ... KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 Hornsby Upper North Shore 2077 Chinese 17.1%, English 16.8%, Australian 14.0%... 16.8 14.0 17.1 No 5.6 5.9 ... No No No No No No No No No No
1 Oatley Southern Suburbs 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,... 23.0 21.1 9.8 No 8.9 No ... No No No No No No No No No No
2 Dulwich Hill Inner West 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S... 19.4 16.4 No 5.2 9.5 No ... No No No No No No No No No No
3 Jannali Sutherland Shire 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S... 28.2 26.3 3.0 No 9.8 No ... No No No No No No No No No No
4 Waverton Lower North Shore 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ... 24.9 15.5 8.4 No 11.0 No ... No No No No No No No No No No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
416 Kyeemagh Southern Suburbs 2216 Greek 19.7%, Australian 14.0%, English 10.0%, ... 10.0 14.0 No 19.7 No No ... No No No No No No No No No No
417 Northwood Lower North Shore 2066 English 26.4%, Australian 23.0%, Irish 12.3%, ... 26.4 23.0 7.0 No 12.3 No ... No No No No No No No No No No
418 Watsons Bay Eastern Suburbs 2030 English 26.0%, Australian 19.0%, Irish 11.2%, ... 26.0 19.0 No No 11.2 No ... No No No No No No 3.9 No No No
419 Lavender Bay Lower North Shore 2060 English 25.8%, Australian 15.1%, Irish 13.7%, ... 25.8 15.1 4.5 No 13.7 No ... No No No No No No No No No No
420 Bexley North Southern Suburbs 2207 Greek 17.5%, Chinese 13.3%, Australian 12.0%, ... 10.6 12.0 13.3 17.5 No No ... No No No No No No No No No No

421 rows × 33 columns

In [49]:
#cethnic['Population'] = cethnic['Population'].apply(pd.to_numeric,errors='coerce')
cethnic['English'] = cethnic['English'].apply(pd.to_numeric,errors='coerce')
cethnic['Australian'] = cethnic['Australian'].apply(pd.to_numeric,errors='coerce')
cethnic['Chinese'] = cethnic['Chinese'].apply(pd.to_numeric,errors='coerce')
cethnic['Greek'] = cethnic['Greek'].apply(pd.to_numeric,errors='coerce')
cethnic['Irish'] = cethnic['Irish'].apply(pd.to_numeric,errors='coerce')
cethnic['Indian'] = cethnic['Indian'].apply(pd.to_numeric,errors='coerce')
cethnic['Scottish']=cethnic['Scottish'].apply(pd.to_numeric,errors='coerce')

cethnic['Italian']=cethnic['Italian'].apply(pd.to_numeric,errors='coerce')
 
cethnic['Nepalese']=cethnic['Nepalese'].apply(pd.to_numeric,errors='coerce')
 
cethnic['Korean']=cethnic['Korean'].apply(pd.to_numeric,errors='coerce')

cethnic['Lebanese']=cethnic['Lebanese'].apply(pd.to_numeric,errors='coerce')

cethnic['Mongolian']=cethnic['Mongolian'].apply(pd.to_numeric,errors='coerce')

cethnic['ChineseScottish']=cethnic['ChineseScottish'].apply(pd.to_numeric,errors='coerce')

cethnic['Vietnamese']=cethnic['Vietnamese'].apply(pd.to_numeric,errors='coerce')

cethnic['Thai']=cethnic['Thai'].apply(pd.to_numeric,errors='coerce')

cethnic['Filipino']=cethnic['Filipino'].apply(pd.to_numeric,errors='coerce')


cethnic['Turkish']=cethnic['Turkish'].apply(pd.to_numeric,errors='coerce')

cethnic['Iraqi']=cethnic['Iraqi'].apply(pd.to_numeric,errors='coerce')

cethnic['Maltese']=cethnic['Maltese'].apply(pd.to_numeric,errors='coerce')

cethnic['KhmerCambodian']= cethnic['KhmerCambodian'].apply(pd.to_numeric,errors='coerce')

cethnic['Assyrian']=cethnic['Assyrian'].apply(pd.to_numeric,errors='coerce')

cethnic['Bangladeshi']=cethnic['Bangladeshi'].apply(pd.to_numeric,errors='coerce')

cethnic['Indonesian']=cethnic['Indonesian'].apply(pd.to_numeric,errors='coerce')

cethnic['SriLankan']=cethnic['SriLankan'].apply(pd.to_numeric,errors='coerce')

cethnic['Samoan']=cethnic['Samoan'].apply(pd.to_numeric,errors='coerce')

cethnic['German']=cethnic['German'].apply(pd.to_numeric,errors='coerce')

cethnic['ScottishChinese']=cethnic['ScottishChinese'].apply(pd.to_numeric,errors='coerce')
cethnic['Macedonian']=cethnic['Macedonian'].apply(pd.to_numeric,errors='coerce')

cethnic['Afghan']=cethnic['Afghan'].apply(pd.to_numeric,errors='coerce')
#cethnic=cethnic.convert_dtypes()
cethnic.info()
cethnic
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421 entries, 0 to 420
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Suburb           421 non-null    object 
 1   Region           421 non-null    object 
 2   Postcode         421 non-null    int64  
 3   Ethnic           421 non-null    object 
 4   English          417 non-null    float64
 5   Australian       419 non-null    float64
 6   Chinese          235 non-null    float64
 7   Greek            42 non-null     float64
 8   Irish            316 non-null    float64
 9   Indian           72 non-null     float64
 10  Scottish         241 non-null    float64
 11  Italian          103 non-null    float64
 12  Nepalese         8 non-null      float64
 13  Korean           20 non-null     float64
 14  Lebanese         61 non-null     float64
 15  Mongolian        1 non-null      float64
 16  ChineseScottish  0 non-null      float64
 17  Vietnamese       30 non-null     float64
 18  Thai             2 non-null      float64
 19  Filipino         30 non-null     float64
 20  Turkish          1 non-null      float64
 21  Iraqi            3 non-null      float64
 22  Maltese          20 non-null     float64
 23  KhmerCambodian   0 non-null      float64
 24  Assyrian         9 non-null      float64
 25  Bangladeshi      6 non-null      float64
 26  Indonesian       1 non-null      float64
 27  SriLankan        0 non-null      float64
 28  Samoan           12 non-null     float64
 29  German           31 non-null     float64
 30  ScottishChinese  0 non-null      float64
 31  Macedonian       4 non-null      float64
 32  Afghan           2 non-null      float64
dtypes: float64(29), int64(1), object(3)
memory usage: 108.7+ KB
Out[49]:
Suburb Region Postcode Ethnic English Australian Chinese Greek Irish Indian ... KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 Hornsby Upper North Shore 2077 Chinese 17.1%, English 16.8%, Australian 14.0%... 16.8 14.0 17.1 NaN 5.6 5.9 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Oatley Southern Suburbs 2223 English 23.0%, Australian 21.1%, Chinese 9.8%,... 23.0 21.1 9.8 NaN 8.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Dulwich Hill Inner West 2203 English 19.4%, Australian 16.4%, Irish 9.5%, S... 19.4 16.4 NaN 5.2 9.5 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Jannali Sutherland Shire 2226 English 28.2%, Australian 26.3%, Irish 9.8%, S... 28.2 26.3 3.0 NaN 9.8 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Waverton Lower North Shore 2060 English 24.9%, Australian 15.5%, Irish 11.0%, ... 24.9 15.5 8.4 NaN 11.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
416 Kyeemagh Southern Suburbs 2216 Greek 19.7%, Australian 14.0%, English 10.0%, ... 10.0 14.0 NaN 19.7 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
417 Northwood Lower North Shore 2066 English 26.4%, Australian 23.0%, Irish 12.3%, ... 26.4 23.0 7.0 NaN 12.3 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
418 Watsons Bay Eastern Suburbs 2030 English 26.0%, Australian 19.0%, Irish 11.2%, ... 26.0 19.0 NaN NaN 11.2 NaN ... NaN NaN NaN NaN NaN NaN 3.9 NaN NaN NaN
419 Lavender Bay Lower North Shore 2060 English 25.8%, Australian 15.1%, Irish 13.7%, ... 25.8 15.1 4.5 NaN 13.7 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
420 Bexley North Southern Suburbs 2207 Greek 17.5%, Chinese 13.3%, Australian 12.0%, ... 10.6 12.0 13.3 17.5 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

421 rows × 33 columns

In [50]:
data_ethnic_clean=cethnic.drop(['Ethnic'],axis='columns')
data_ethnic_clean.head()
Out[50]:
Suburb Region Postcode English Australian Chinese Greek Irish Indian Scottish ... KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 Hornsby Upper North Shore 2077 16.8 14.0 17.1 NaN 5.6 5.9 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Oatley Southern Suburbs 2223 23.0 21.1 9.8 NaN 8.9 NaN 5.7 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Dulwich Hill Inner West 2203 19.4 16.4 NaN 5.2 9.5 NaN 6.2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Jannali Sutherland Shire 2226 28.2 26.3 3.0 NaN 9.8 NaN 6.5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Waverton Lower North Shore 2060 24.9 15.5 8.4 NaN 11.0 NaN 8.1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 32 columns

here we have to find any duplicate rows but as you see there are no duplications.

Duplicate Data¶

In [51]:
#cethnic[cethnic.duplicated(["Suburb",'Region','Postcode','Greek'], keep=False)]
dup_ethnic = cethnic.duplicated()
print('Number of duplicate rows = %d' % (dup_ethnic.sum()))
Number of duplicate rows = 0

Here we tried to find duplicate values in term of the price but there are no duplications.

In [52]:
dups = data_price.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
data_price.loc[[0,100]]
Number of duplicate rows = 0
Out[52]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd
0 530000 13/1/16 Kincumber 4 4 2 1351 House 7093 29432 9.914 -33.47252 151.40208 24 2.0 150.9 47.05
100 3100000 27/2/16 Point Clare 4 6 2 2245 House 3731 31772 3.336 -33.44017 151.31715 62 2.0 150.9 48.17

Data merging¶

Here both the two of the data set has been merged and we got a huge data combined. inlcuing 7000 rows and 48 coloums. from here we got the data of different suburb prices, data sold, property sizes. population, suburb median income, and the suburb lies within which region and most important each and every ethic group people which are English speaking, Austrlian, Irish, Indian and many more.

In [53]:
data_merge = data_price.merge(data_ethnic_clean[:], left_on='suburb',
                  right_on='Suburb').drop('Suburb', axis='columns')
data_merge
Out[53]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income ... KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 452000 13/1/16 Leumeah 1 3 1 344 House 9835 32292 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 495000 15/1/16 Leumeah 1 3 2 582 House 9835 32292 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 473000 19/1/16 Leumeah 1 3 3 581 House 9835 32292 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 520000 20/1/16 Leumeah 1 3 1 651 House 9835 32292 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 585000 22/9/17 Leumeah 1 3 1 436 House 9835 32292 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 1900000 16/11/21 Milsons Point 2 2 1 124 Apartment / Unit / Flat 2158 87360 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7107 1000000 21/11/21 Eastern Creek 2 5 4 601 House 827 33852 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7108 1065000 13/12/21 Eastern Creek 3 3 2 949 House 827 33852 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7109 1000000 13/12/21 Eastern Creek 2 4 2 910 House 827 33852 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7110 1115000 15/12/21 Sadleir 2 4 2 639 House 3135 21060 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

7111 rows × 48 columns

Show all columns

We´ll drop this one, it carries no relevant information at all.

In [54]:
pd.set_option('display.max_columns', None)
In [55]:
data_merge.drop(['Region','Postcode'],axis=1)
Out[55]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 452000 13/1/16 Leumeah 1 3 1 344 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 22.7 23.0 NaN NaN 6.8 3.4 5.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 495000 15/1/16 Leumeah 1 3 2 582 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 22.7 23.0 NaN NaN 6.8 3.4 5.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 473000 19/1/16 Leumeah 1 3 3 581 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 22.7 23.0 NaN NaN 6.8 3.4 5.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 520000 20/1/16 Leumeah 1 3 1 651 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 22.7 23.0 NaN NaN 6.8 3.4 5.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 585000 22/9/17 Leumeah 1 3 1 436 House 9835 32292 4.055 -34.05375 150.83957 81 1.5 174.0 40.12 22.7 23.0 NaN NaN 6.8 3.4 5.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 1900000 16/11/21 Milsons Point 2 2 1 124 Apartment / Unit / Flat 2158 87360 0.198 -33.84723 151.21163 42 0.1 220.1 1.99 20.3 9.7 15.8 NaN 8.1 NaN 6.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7107 1000000 21/11/21 Eastern Creek 2 5 4 601 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 21.6 22.7 NaN NaN 5.7 3.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7108 1065000 13/12/21 Eastern Creek 3 3 2 949 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 21.6 22.7 NaN NaN 5.7 3.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7109 1000000 13/12/21 Eastern Creek 2 4 2 910 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 21.6 22.7 NaN NaN 5.7 3.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN 6.9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7110 1115000 15/12/21 Sadleir 2 4 2 639 House 3135 21060 0.899 -33.91680 150.89054 25 0.1 220.1 30.09 15.1 17.7 NaN NaN 4.3 NaN NaN NaN NaN NaN 14.1 NaN NaN 9.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

7111 rows × 46 columns

Fill null to 0¶

In [56]:
#fill null to 0
data_merge = data_merge.fillna(0)
data_merge
Out[56]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd Region Postcode English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 452000 13/1/16 Leumeah 1 3 1 344 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 495000 15/1/16 Leumeah 1 3 2 582 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 473000 19/1/16 Leumeah 1 3 3 581 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 520000 20/1/16 Leumeah 1 3 1 651 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 585000 22/9/17 Leumeah 1 3 1 436 House 9835 32292 4.055 -34.05375 150.83957 81 1.5 174.0 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 1900000 16/11/21 Milsons Point 2 2 1 124 Apartment / Unit / Flat 2158 87360 0.198 -33.84723 151.21163 42 0.1 220.1 1.99 Lower North Shore 2061 20.3 9.7 15.8 0.0 8.1 0.0 6.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7107 1000000 21/11/21 Eastern Creek 2 5 4 601 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 Western Suburbs 2766 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7108 1065000 13/12/21 Eastern Creek 3 3 2 949 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 Western Suburbs 2766 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7109 1000000 13/12/21 Eastern Creek 2 4 2 910 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 Western Suburbs 2766 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7110 1115000 15/12/21 Sadleir 2 4 2 639 House 3135 21060 0.899 -33.91680 150.89054 25 0.1 220.1 30.09 Western Suburbs 2168 15.1 17.7 0.0 0.0 4.3 0.0 0.0 0.0 0.0 0.0 14.1 0.0 0.0 9.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

7111 rows × 48 columns

In [57]:
data_price[(data_price['num_bath']==0) | (data_price['num_bed']==0)]
Out[57]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd
4 365500 13/1/16 North Avoca 0 0 0 1850 Vacant land 2200 45084 1.497 -33.45608 151.43598 18 2.0 150.9 49.98
9 242500 15/1/16 Winmalee 0 0 0 1248 Vacant land 6202 38740 9.058 -33.67971 150.61119 263 2.0 150.9 59.15
67 320000 17/2/16 Rosemeadow 0 0 0 592 Vacant land 7862 32864 2.976 -34.10624 150.79181 149 2.0 150.9 46.98
99 225000 26/2/16 Chittaway Bay 0 1 0 486 Vacant land 2028 31668 1.116 -33.32678 151.44557 3 2.0 150.9 63.59
154 600000 21/3/16 Kingswood 0 0 0 871 Vacant land 9301 33228 4.782 -33.76129 150.72784 42 2.0 155.2 46.09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11068 1380000 22/12/21 Caringbah 0 3 1 210 Townhouse 11658 42380 4.089 -34.03534 151.12423 22 0.1 220.1 20.47
11081 1100000 22/12/21 Glenmore Park 0 5 3 540 House 23004 45916 9.740 -33.79317 150.68049 54 0.1 220.1 49.65
11091 891000 22/12/21 Glenfield 0 3 1 605 House 9633 37024 6.995 -33.97099 150.89301 29 0.1 220.1 31.57
11092 670000 22/12/21 Carlton 0 2 1 109 Apartment / Unit / Flat 10722 32656 2.065 -33.97075 151.12140 36 0.1 220.1 14.29
11104 1600000 23/12/21 Kellyville 0 0 0 542 Vacant land 27971 46228 18.645 -33.69583 150.95622 78 0.1 220.1 30.08

205 rows × 17 columns

Correct data types¶

In [58]:
# Correct data types
data_merge['date_sold'] = pd.to_datetime(data_merge['date_sold'])

Analysis¶

In [59]:
test= data_merge.drop_duplicates(subset=["suburb"],keep='first')
test
Out[59]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd Region Postcode English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 452000 2016-01-13 Leumeah 1 3 1 344 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 890000 2016-01-18 Picnic Point 2 4 3 715 House 6160 40560 3.859 -33.97301 151.00632 33 2.0 150.9 22.31 Western Suburbs 2213 20.7 23.1 0.0 4.9 8.2 0.0 5.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
55 533000 2016-01-18 Whalan 3 4 2 695 House 5973 24180 2.429 -33.75572 150.80361 37 2.0 150.9 39.53 Western Suburbs 2770 21.6 24.9 0.0 0.0 4.7 0.0 4.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.2 0.0 0.0 0.0 0.0
56 1120500 2016-01-19 North Rocks 2 4 2 904 House 7965 40092 5.462 -33.77572 151.01474 92 2.0 150.9 20.61 Hills Shire 2151 20.1 19.4 14.9 0.0 6.8 0.0 5.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
86 675000 2016-01-19 Bass Hill 3 3 2 263 Townhouse 9069 24388 2.929 -33.90028 150.99309 40 2.0 150.9 20.43 Western Suburbs 2197 11.0 13.5 0.0 0.0 0.0 0.0 0.0 4.3 0.0 0.0 20.9 0.0 0.0 7.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7101 1320000 2021-10-19 Narellan 2 4 2 633 House 3616 35516 4.276 -34.04378 150.73456 91 0.1 220.1 48.20 South West 2567 29.0 31.5 0.0 0.0 8.1 0.0 6.3 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7104 1035501 2021-06-11 Kearns 3 5 2 963 House 2745 40092 2.359 -34.01979 150.80135 87 0.1 210.1 41.48 South West 2558 23.7 26.7 0.0 0.0 6.8 0.0 6.4 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7106 1900000 2021-11-16 Milsons Point 2 2 1 124 Apartment / Unit / Flat 2158 87360 0.198 -33.84723 151.21163 42 0.1 220.1 1.99 Lower North Shore 2061 20.3 9.7 15.8 0.0 8.1 0.0 6.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7107 1000000 2021-11-21 Eastern Creek 2 5 4 601 House 827 33852 20.431 -33.79734 150.84976 63 0.1 220.1 34.17 Western Suburbs 2766 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7110 1115000 2021-12-15 Sadleir 2 4 2 639 House 3135 21060 0.899 -33.91680 150.89054 25 0.1 220.1 30.09 Western Suburbs 2168 15.1 17.7 0.0 0.0 4.3 0.0 0.0 0.0 0.0 0.0 14.1 0.0 0.0 9.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

378 rows × 48 columns

taking data of those areas having highest level of English speakers.

In [60]:
largest_english=test.nlargest(n=10
                            , columns=['English'])
largest_english
Out[60]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd Region Postcode English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
196 1030000 2016-02-16 Avalon Beach 1 3 4 650 House 9905 43628 5.131 -33.62899 151.32686 46 2.00 150.9 28.35 Northern Beaches 2107 34.6 24.6 0.0 0.0 10.7 0.0 8.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.9 0.0 0.0 0.0
4537 1750000 2017-07-15 Palm Beach 2 4 1 1615 House 1593 54704 2.716 -33.59895 151.32589 0 1.50 174.2 31.42 Northern Beaches 2108 34.2 21.2 0.0 0.0 10.1 0.0 11.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.1 0.0 0.0 0.0
4425 3100000 2017-06-21 Lilli Pilli 4 6 4 1514 House 1390 51636 0.607 -34.06716 151.11552 42 1.50 174.2 24.05 Sutherland Shire 2229 33.0 25.4 0.0 0.0 11.4 0.0 6.4 3.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6890 1067500 2020-08-26 Kurnell 1 3 2 607 House 2267 39364 19.199 -34.02098 151.18685 1 0.25 167.6 17.42 Sutherland Shire 2231 32.8 30.6 0.0 0.0 8.6 0.0 6.5 2.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2184 1250000 2016-09-24 Fairlight 1 2 1 78 Apartment / Unit / Flat 5840 65000 1.168 -33.79412 151.27428 38 1.50 167.6 9.88 Northern Beaches 2094 32.7 20.3 0.0 0.0 11.5 0.0 8.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0
4009 1870000 2017-03-31 Newport 2 5 4 631 House 9301 46852 3.812 -33.65717 151.31571 12 1.50 176.6 25.06 Northern Beaches 2106 32.4 24.6 0.0 0.0 10.1 0.0 8.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.1 0.0 0.0 0.0
6525 1580000 2019-07-25 Davidson 3 4 2 1141 House 2849 45292 2.567 -33.73950 151.19204 137 1.02 159.0 14.03 Northern Suburbs 2085 32.1 24.1 2.8 0.0 9.1 0.0 7.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1662 950000 2016-07-22 Berowra 1 3 2 841 House 4721 44200 8.297 -33.62806 151.15004 194 1.75 159.3 26.88 Northern Suburbs 2081 31.9 28.5 0.0 0.0 9.6 0.0 7.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.2 0.0 0.0 0.0
5332 2600000 2017-12-21 Mona Vale 2 4 1 658 House 10670 45240 4.556 -33.67706 151.30276 15 1.50 171.9 22.57 Northern Beaches 2103 31.5 24.4 0.0 0.0 8.9 0.0 7.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.9 0.0 0.0 0.0
2203 1172500 2016-09-24 Narrabeen 1 2 1 109 Apartment / Unit / Flat 8207 48932 2.197 -33.71505 151.29588 0 1.50 167.6 18.45 Northern Beaches 2101 31.4 22.6 0.0 0.0 9.6 0.0 8.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.8 0.0 0.0 0.0

Bar graph showing highest to lowest ethnics with English at top and with the top most highest suburb in term of ethnic presence.

7.What is overall trend of highest and lowest percentage of ethnics of language speakers in 10 different suburbs?¶

In [61]:
fig = px.bar(largest_english,x='suburb',  y=["English", "Australian",'Chinese','Irish','Scottish','Italian','German'], text_auto='.2s',
            title="Default: various text sizes, positions and angles")
fig.show()

As it can be seen clearly there is a very slight change in speaker of English in 10 different suburbs and also we can say there is no change in the lowest language speakers which are German. There is fluctuations in both of them but can be seem no huge changes are there in any of these.

In [62]:
#Numerical Dataset
df_numerical = data_price.drop(['suburb','type','property_inflation_index','cash_rate','suburb_elevation'],axis=1)
df_numerical
Out[62]:
price date_sold num_bath num_bed num_parking property_size suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng km_from_cbd
0 530000 13/1/16 4 4 2 1351 7093 29432 9.914 -33.47252 151.40208 47.05
1 525000 13/1/16 2 4 2 594 2538 24752 1.397 -33.21772 151.55237 78.54
2 480000 13/1/16 2 4 2 468 2028 31668 1.116 -33.32678 151.44557 63.59
3 452000 13/1/16 1 3 1 344 9835 32292 4.055 -34.05375 150.83957 40.12
4 365500 13/1/16 0 0 0 1850 2200 45084 1.497 -33.45608 151.43598 49.98
... ... ... ... ... ... ... ... ... ... ... ... ...
11155 1900000 31/12/21 3 4 2 540 27971 46228 18.645 -33.69583 150.95622 30.08
11156 1300000 31/12/21 3 7 2 1208 19326 33540 9.629 -33.77743 150.94272 26.58
11157 1025000 31/12/21 2 2 1 129 17252 35412 2.940 -33.86794 151.20998 0.31
11158 1087500 1/1/22 2 4 2 384 15313 36244 9.215 -33.94155 150.87334 32.26
11159 1000000 1/1/22 2 3 2 667 3951 37180 87.154 -33.31517 151.32611 61.95

11160 rows × 12 columns

What does Sydney houses pricing dataset show about the prices of the houses over different period?¶

In [63]:
plt.scatter(data_merge.date_sold, data_merge.price, marker='o', c='b',edgecolor='r', alpha=0.5)
plt.title("Sydney housing price over years")
plt.grid()
plt.show()

In this dataset there is shown from the year 2016 to 2022. The prices on 2016 is very low and the peak value is almost 0.25 and in 2022 the values are increasing and the highest values are 1.65 and 1.75 which is a huge change and with this graph it is clearly mentioned that the prices are going very high with each year passing.

How do the house prices in Sydney vary by suburb according to the data in the "sydney-house-prices" dataset¶

In [64]:
suburbs = data_merge.groupby(['suburb'])['price'].aggregate('median')
print(suburbs)
suburb
Abbotsbury          1425000.0
Abbotsford          2107500.0
Alexandria          1350000.0
Allambie Heights    1875000.0
Allawah             1688000.0
                      ...    
Woronora            1203000.0
Yagoona              870000.0
Yarrawarrah         1492500.0
Yowie Bay           1980000.0
Zetland             1445000.0
Name: price, Length: 378, dtype: float64
In [65]:
suburbs = pd.DataFrame(suburbs).reset_index()
suburbs.head()
Out[65]:
suburb price
0 Abbotsbury 1425000.0
1 Abbotsford 2107500.0
2 Alexandria 1350000.0
3 Allambie Heights 1875000.0
4 Allawah 1688000.0
In [66]:
results = []

for i in suburbs['price']:
    if i <= suburbs['price'].quantile(0.25):
        results.append('low')
    elif i > suburbs['price'].quantile(0.25) and i <= suburbs['price'].quantile(0.50):
        results.append('medium_low')
    elif i > suburbs['price'].quantile(0.50) and i <= suburbs['price'].quantile(0.75):
        results.append('medium_high')
    else:
        results.append('high')
        
    
# add to new dataframe
suburbs['suburb_group'] = results
In [67]:
suburbs.head()
Out[67]:
suburb price suburb_group
0 Abbotsbury 1425000.0 medium_low
1 Abbotsford 2107500.0 high
2 Alexandria 1350000.0 medium_low
3 Allambie Heights 1875000.0 medium_high
4 Allawah 1688000.0 medium_high

Based on the dataset, it appears that the house prices in Sydney vary greatly by suburb. Some suburbs have a high median price (e.g. Abbotsford), while others have a medium-high median price (e.g. Allambie Heights, Allawah). Additionally, some suburbs have a medium-low median price (e.g. Abbotsbury, Alexandria), while others have a low median price. This suggests that the house prices in Sydney can vary significantly depending on the suburb.

In [68]:
# converting into a dictionary
suburb = suburbs['suburb']
suburb_group = suburbs['suburb_group']
suburb_dict = dict(zip(suburb, suburb_group))
In [69]:
# converting to an array to fill with grouped values
convert = data_merge.suburb
convert = list(convert)


# conversion into an array
new_list = []

for conversion in convert:
    for check in suburb_dict:
        if conversion == check:
            new_list.append(suburb_dict[check])
print(f"Length of dataframe : {len(data_merge)}")
print(f"Length of array : {len(new_list)}")
Length of dataframe : 7111
Length of array : 7111
In [70]:
data_merge['suburb_group'] = new_list
data_merge.head()
Out[70]:
price date_sold suburb num_bath num_bed num_parking property_size type suburb_population suburb_median_income suburb_sqkm suburb_lat suburb_lng suburb_elevation cash_rate property_inflation_index km_from_cbd Region Postcode English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan suburb_group
0 452000 2016-01-13 Leumeah 1 3 1 344 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 low
1 495000 2016-01-15 Leumeah 1 3 2 582 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 low
2 473000 2016-01-19 Leumeah 1 3 3 581 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 low
3 520000 2016-01-20 Leumeah 1 3 1 651 House 9835 32292 4.055 -34.05375 150.83957 81 2.0 150.9 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 low
4 585000 2017-09-22 Leumeah 1 3 1 436 House 9835 32292 4.055 -34.05375 150.83957 81 1.5 174.0 40.12 South West 2560 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 low
In [71]:
data_merge['type'].value_counts()
Out[71]:
House                            5831
Apartment / Unit / Flat           609
Townhouse                         178
Semi-Detached                     131
Vacant land                       102
Villa                              87
Terrace                            60
Duplex                             53
Block of Units                     30
New House & Land                   10
Development Site                    6
New Apartments / Off the Plan       6
Studio                              5
New land                            2
Acreage / Semi-Rural                1
Name: type, dtype: int64
In [72]:
data_merge['type'].unique()
Out[72]:
array(['House', 'Townhouse', 'Vacant land', 'Semi-Detached', 'Duplex',
       'Apartment / Unit / Flat', 'Villa', 'Terrace', 'New House & Land',
       'Block of Units', 'Development Site', 'Studio',
       'New Apartments / Off the Plan', 'New land',
       'Acreage / Semi-Rural'], dtype=object)
In [73]:
sns.set()
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(1,1,figsize=(20,15),sharex=False)

#county_order = data_merge.sort_values('price')['Region']

#sns.barplot(x='State', y='Headcount Ratio (%)', data=gdf_regions, ax=ax1, color='c', )

sns.barplot(ax=ax,data =data_merge, x="type", y="price" )
ax.set_ylabel("price")
ax.set_xlabel(" Type of properties ")
ax.title.set_text("Sales Price and  type")
plt.xticks(rotation='vertical')
plt.show()
In [74]:
# converting to array 
types = data_merge['type'].to_list()

# groupings
house = ['House', 'New House & Land', 'Villa','Duplex','Terrace']
apartments = ['Apartment / Unit / Flat', 'Townhouse', 'Studio']

# new array
new_types = []

for conversion in types:
    if conversion in house:
        new_types.append('house')
    elif conversion in apartments:
        new_types.append('apartments')
    else:
        new_types.append('other')

# inputting into dataframe
data_merge['type'] = new_types
In [75]:
fig, ax = plt.subplots(1,1,figsize=(16, 6), dpi=100)
sns.set_theme(style='whitegrid')
sns.despine()
sns.kdeplot(data_merge[data_merge['type']=='house']['price'], color='#1EAE98', ax=ax, label='House')
sns.kdeplot(data_merge[data_merge['type']=='apartments']['price'], color='#5800FF', ax=ax, label='Apartments')
sns.kdeplot(data_merge[data_merge['type']=='other']['price'], color='#B8B5FF', ax=ax, label='Other')
plt.legend()
plt.show()

What is the relationship between the sales price of properties in Sydney and factors such as the type of region, number of beds, property size, distance from the CBD, median income of the suburb, and population of the suburb?¶

In [76]:
import plotly.express as px
sqft_trend = pd.concat([data_merge['price'], data_merge['property_size']], axis = 1)
fig = px.scatter(data_merge, x = 'property_size', y = 'price', title = 'Price vs Property Size', labels = 'dict(price = "Price ", sqft_living = "Sqft ")')
fig.update_layout(yaxis_range = [0 , 16500000], xaxis_range = [0 , 7000], width = 800, height = 600)
fig.show()
In [77]:
import plotly.express as px
sqft_trend = pd.concat([data_merge['price'], data_merge['km_from_cbd']], axis = 1)
fig = px.scatter(data_merge, x = 'km_from_cbd', y = 'price', title = 'Price vs KM from CBD', labels = dict(price = "Price ", sqft_living = "Sqft "))
fig.update_layout(yaxis_range = [0 , 16500000], xaxis_range = [0 , 60], width = 800, height = 600)
fig.show()
In [78]:
sqft_trend = pd.concat([data_merge['price'], data_merge['date_sold']], axis = 1)
fig = px.scatter(data_merge, x = 'date_sold', y = 'price', title = 'Price vs Date Sold', labels = dict(price = "Price ", sqft_living = "Sqft "))
fig.update_layout(yaxis_range = [0 , 16500000], xaxis_range = ["2016", "2022"], width = 800, height = 600)
fig.show()
In [79]:
sns.set()
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(1,1,figsize=(20,15),sharex=False)

#county_order = data_merge.sort_values('price')['Region']

#sns.barplot(x='State', y='Headcount Ratio (%)', data=gdf_regions, ax=ax1, color='c', )

sns.barplot(ax=ax,data =data_merge, x="Region", y="price" )
ax.set_ylabel("Region")
ax.set_xlabel(" Type of region")
ax.title.set_text("Sales Price and  region")
plt.xticks(rotation='vertical')
plt.show()
In [80]:
sns.set()
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots(3,2,figsize=(18,15))

sns.barplot(ax=ax[0, 0],data =largest_bedroom, x="num_bed", y="price")
ax[0,0].set_ylabel("Sales Price")
ax[0,0].set_xlabel(" Type of number of bed")
ax[0,0].title.set_text("Sales Price and  number of bed")

sns.barplot(ax=ax[0, 1],data =data_merge, x="num_bath", y="price")
ax[0,1].set_ylabel("Sales Price")
ax[0,1].set_xlabel("Number of bath")
ax[0,1].title.set_text("Sales Price and  number of bath")

sns.barplot(ax=ax[1, 0],data =data_merge, x="num_parking", y="price")
ax[1,0].set_ylabel("Sales Price")
ax[1,0].set_xlabel("Number of parking")
ax[1,0].title.set_text("Sales Price and  number of parking")


sns.barplot(ax=ax[1, 1],data =data_merge, x="type", y="price")
ax[1,0].set_ylabel("Sales Price")
ax[1,0].set_xlabel("type")
ax[1,0].title.set_text("Sales Price and type")


sns.barplot(ax=ax[2, 0],data =data_merge, x="suburb_median_income", y="price")
ax[1,0].set_ylabel("Sales Price")
ax[1,0].set_xlabel("suburb_median_income")
ax[1,0].title.set_text("Sales Price and suburb_median_income")

sns.barplot(ax=ax[2, 1],data =data_merge, x="suburb_population", y="price")
ax[1,0].set_ylabel("Sales Price")
ax[1,0].set_xlabel("suburb_population")
ax[1,0].title.set_text("Sales Price and suburb population")

plt.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/80/3rj880k53pj5mtpfdy8rnhcm0000gn/T/ipykernel_8618/1392010366.py in <module>
      3 fig, ax = plt.subplots(3,2,figsize=(18,15))
      4 
----> 5 sns.barplot(ax=ax[0, 0],data =largest_bedroom, x="num_bed", y="price")
      6 ax[0,0].set_ylabel("Sales Price")
      7 ax[0,0].set_xlabel(" Type of number of bed")

NameError: name 'largest_bedroom' is not defined

We can observe the relationship between the sales price of properties in Sydney and various factors such as the type of region, number of beds, property size, distance from the CBD, median income of the suburb, and population of the suburb. The graphs display the average sales price on the y-axis and the different factors on the x-axis. From the graphs, we can see if there is a positive or negative correlation between the sales price and each factor, such as if higher median income in the suburb leads to higher sales prices or if properties closer to the CBD have higher sales prices compared to those further away.

Latitude and Longitud relationship with Map¶

In [81]:
from folium import Map
from folium.plugins import HeatMap
heat_data = [[row['suburb_lat'],row['suburb_lng']] for _, row in data_merge.iterrows()]
heat_map = Map(data_merge[['suburb_lat', 'suburb_lng']].mean(axis=0), zoom_start=10) 
HeatMap(heat_data, radius=10).add_to(heat_map)
heat_map
Out[81]:
Make this Notebook Trusted to load map: File -> Trust Notebook

What is the relationship between the population of a suburb and its review score?¶

What can be the main factor of properties expensiveness?¶

In [82]:
import branca
import folium
inferno_colors = [
    (0, 0, 4),
    (40, 11, 84),
    (101, 21, 110),
    (159, 42, 99),
    (212, 72, 66),
    (245, 125, 21),
    (250, 193, 39),
    (252, 255, 164)
]

map = folium.Map(data_merge[['suburb_lat', 'suburb_lng']].mean(axis=0), zoom_start = 11)
lat = list(data_merge.suburb_lat)
lon = list(data_merge.suburb_lng)
populations = list(data_merge.suburb_population)
targets = list(data_merge.price)
 
# define colormap using inferno colors and normalizing them according MedHouseVal
cmap = branca.colormap.LinearColormap(
    inferno_colors, vmin=min(targets), vmax=max(targets)
)

for loc, population, target in zip(zip(lat, lon), populations, targets):
    folium.Circle(
        location=loc,
        radius=population/20,
        fill=True,
            color=cmap(target),
        fill_opacity=0.5,
        weight=0
    ).add_to(map)

map.add_child(cmap)
display(map)
Make this Notebook Trusted to load map: File -> Trust Notebook

It is possible to observe the relationship between the population of a suburb and its review score by the size and colour of the circles. If larger circles have a darker colour, it will indicate that larger population suburbs have higher review scores, and vice versa. This visualization can help in understanding the relationship between the population of a suburb and its review score, allowing for insights into consumer opinions in the area. The graph shows that suburbs located further away from the CBD have a higher score.

Those properties whether they are apartments, houses or any other type of properties which are near to CBD are much more expensive than those who has some distance from main CBD. And those properties which are nearer to beaches whether the beaches are near to cities or far from cities tend to be much more expensive as seen in the graph. But those properties which are in CBD and near to sea are much more expensive than all others as they pose both the properties to be in CBD and at the side of the sea.

Prepraring data for prediction¶

In [102]:
prediction_df= data_merge.copy
prediction_df=data_merge.drop(['date_sold','suburb_lat','suburb_lng','suburb_group','Postcode','Region','type','suburb'], axis='columns')
prediction_df
Out[102]:
price num_bath num_bed num_parking property_size suburb_population suburb_median_income suburb_sqkm suburb_elevation cash_rate property_inflation_index km_from_cbd English Australian Chinese Greek Irish Indian Scottish Italian Nepalese Korean Lebanese Mongolian ChineseScottish Vietnamese Thai Filipino Turkish Iraqi Maltese KhmerCambodian Assyrian Bangladeshi Indonesian SriLankan Samoan German ScottishChinese Macedonian Afghan
0 452000 1 3 1 344 9835 32292 4.055 81 2.0 150.9 40.12 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 495000 1 3 2 582 9835 32292 4.055 81 2.0 150.9 40.12 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 473000 1 3 3 581 9835 32292 4.055 81 2.0 150.9 40.12 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 520000 1 3 1 651 9835 32292 4.055 81 2.0 150.9 40.12 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 585000 1 3 1 436 9835 32292 4.055 81 1.5 174.0 40.12 22.7 23.0 0.0 0.0 6.8 3.4 5.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 1900000 2 2 1 124 2158 87360 0.198 42 0.1 220.1 1.99 20.3 9.7 15.8 0.0 8.1 0.0 6.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7107 1000000 2 5 4 601 827 33852 20.431 63 0.1 220.1 34.17 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7108 1065000 3 3 2 949 827 33852 20.431 63 0.1 220.1 34.17 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7109 1000000 2 4 2 910 827 33852 20.431 63 0.1 220.1 34.17 21.6 22.7 0.0 0.0 5.7 3.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7110 1115000 2 4 2 639 3135 21060 0.899 25 0.1 220.1 30.09 15.1 17.7 0.0 0.0 4.3 0.0 0.0 0.0 0.0 0.0 14.1 0.0 0.0 9.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

7111 rows × 41 columns

In [106]:
X=prediction_df.drop(['price'],axis=1)
y=prediction_df['price']
In [85]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=101)

The training dataset and test dataset must be similar, usually have the same predictors or variables. They differ on the observations and specific values in the variables. If you fit the model on the training dataset, then you implicitly minimize error or find correct responses. The fitted model provides a good prediction on the training dataset. Then you test the model on the test dataset. If the model predicts good also on the test dataset, you have more confidence. You have more confidence since the test dataset is similar to the training dataset, but not the same nor seen by the model. It means the model transfers prediction or learning in real sense.

So,by splitting dataset into training and testing subset, we can efficiently measure our trained model since it never sees testing data before.Thus it's possible to prevent overfitting.

I am just splitting dataset into 30% of test data and remaining 80% will used for training the model.

In [86]:
scaler=StandardScaler()
Scaled_X_train=scaler.fit_transform(X_train)
Scaled_X_test=scaler.transform(X_test)
In [87]:
#ALL_Types_of_Regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor,AdaBoostRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.metrics import confusion_matrix,classification_report,plot_confusion_matrix
import numpy as np
In [124]:
def run_model(model,X_train,X_test,y_train,y_test):
    Scaled_X_train=scaler.fit_transform(X_train).round()
    Scaled_X_test=scaler.transform(X_test).round()
    model.fit(Scaled_X_train,y_train)
    preds=model.predict(Scaled_X_test)
    rmse=np.sqrt(mean_squared_error(y_test,preds))
    mae=mean_absolute_error(y_test,preds)
    print(f'MAE: {mae}')
    print(f'RMSE: {rmse}')
In [125]:
#Linear_Regression
lr_model=LinearRegression()
run_model(lr_model,X_train,X_test,y_train,y_test)
preds=lr_model.predict(Scaled_X_test)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/80/3rj880k53pj5mtpfdy8rnhcm0000gn/T/ipykernel_8618/3089355276.py in <module>
      1 #Linear_Regression
      2 lr_model=LinearRegression()
----> 3 run_model(lr_model,X_train,X_test,y_train,y_test)
      4 preds=lr_model.predict(Scaled_X_test)

/var/folders/80/3rj880k53pj5mtpfdy8rnhcm0000gn/T/ipykernel_8618/3245164852.py in run_model(model, X_train, X_test, y_train, y_test)
      4     model.fit(Scaled_X_train,y_train)
      5     preds=model.predict(Scaled_X_test)
----> 6     rmse=np.sqrt(mean_squared_error(y_test,preds))
      7     mae=mean_absolute_error(y_test,preds)
      8     print(f'MAE: {mae}')

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_regression.py in mean_squared_error(y_true, y_pred, sample_weight, multioutput, squared)
    436     0.825...
    437     """
--> 438     y_type, y_true, y_pred, multioutput = _check_reg_targets(
    439         y_true, y_pred, multioutput
    440     )

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_regression.py in _check_reg_targets(y_true, y_pred, multioutput, dtype)
     92         the dtype argument passed to check_array.
     93     """
---> 94     check_consistent_length(y_true, y_pred)
     95     y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
     96     y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:
--> 332         raise ValueError(
    333             "Found input variables with inconsistent numbers of samples: %r"
    334             % [int(l) for l in lengths]

ValueError: Found input variables with inconsistent numbers of samples: [2135, 2134]
In [90]:
residuals=y_test-preds
residuals
sns.scatterplot(x=y_test,y=residuals)
sns.displot(residuals,kde=True)
Out[90]:
<seaborn.axisgrid.FacetGrid at 0x7fe50807aa60>
In [ ]:
 
In [91]:
import scipy as sp
g,ax=plt.subplots(figsize=(6,8),dpi=50)
_=sp.stats.probplot(residuals,plot=ax)
final_model=LinearRegression()
final_model.fit(X_train,y_train)
Out[91]:
LinearRegression()
In [92]:
X_train.shape,y_train.shape,X_test.shape,y_test.shape
Out[92]:
((4977, 40), (4977,), (2134, 40), (2134,))

What is the predicted prices and actual prices of the houses and elaborate the relation and trends?¶

In [93]:
c = [i for i in range(1,2135,1)] # generating index 
fig = plt.figure(figsize=(6,4))
plt.plot(c,y_test, color="blue", linewidth=2.5, linestyle="-") #Plotting Actual
plt.plot(c,preds, color="red",  linewidth=2.5, linestyle="-") #Plotting predicted
fig.suptitle('Actual and Predicted', fontsize=15)              # Plot heading 
plt.xlabel('Index', fontsize=18)                               # X-label
plt.ylabel('Housing Price', fontsize=16)    
Out[93]:
Text(0, 0.5, 'Housing Price')

There is a huge difference between the actual and predicted prices of the houses. The Red indicates the predicted prices of the houses while the blue indicates the real prices. Although we can see there are some peaks also which indicates a very high real price. While on predicated there is a slight peak which shows a value of 0.6 only but on another hand the actual prices have more than 1.4 also.

In [94]:
print(final_model.coef_)
[ 1.90952483e+05  7.73073088e+04  3.50767572e+04  4.17962300e+02
  2.14514438e+00  2.12610029e+01  1.08144499e+03 -1.33076066e+03
 -6.45429990e+04  8.49620180e+03 -2.18478904e+04  5.10626894e+04
 -4.84360120e+04  1.71636744e+04  2.87914722e+04  2.68882703e+04
  1.27869053e+04  3.41602680e+04  2.37376813e+04 -3.61130758e+03
  2.00843074e+04  1.24194221e+04 -1.98972230e+05  1.44354999e-08
  2.15756376e+02 -2.05407795e+04  1.49623387e+04 -2.08061319e+04
  5.60173405e+04 -5.84639580e+03 -1.46246748e-09  1.20300485e+04
  2.98609231e+04  5.19862468e+04 -8.00355338e-11  8.33787920e+04
  3.61649711e+04  0.00000000e+00  2.37581836e+04  1.64297683e+05]
In [95]:
print(final_model.intercept_)
-1881184.3953311269

How to compare the performance of different machine learning models?¶

In [96]:
from time import time
from sklearn.metrics import explained_variance_score
from sklearn.ensemble import RandomForestRegressor

rand_regr = RandomForestRegressor(n_estimators=400,random_state=0)
start = time()
rand_regr.fit(X_train, y_train)
end=time()
train_time_rand=end-start
random=rand_regr.score(X_test,y_test)
predictions = rand_regr.predict(X_test)
exp_rand = explained_variance_score(predictions,y_test)
In [97]:
from sklearn.ensemble import GradientBoostingRegressor
start = time()
est=GradientBoostingRegressor(n_estimators=400, max_depth=5,min_samples_split=2,learning_rate=0.1).fit(X_train, y_train)
end=time()
train_time_g=end-start
gradient=est.score(X_test,y_test)

pred = est.predict(X_test)
exp_est = explained_variance_score(pred,y_test)
In [98]:
from sklearn.ensemble import AdaBoostRegressor
start = time()
ada=AdaBoostRegressor(n_estimators=50, learning_rate=0.2,loss='exponential').fit(X_train, y_train)
end=time()
train_time_ada=end-start
pred=ada.predict(X_test)
adab=ada.score(X_test,y_test)
predict = ada.predict(X_test)
exp_ada = explained_variance_score(predict,y_test)
In [99]:
from sklearn.tree  import DecisionTreeRegressor
decision=DecisionTreeRegressor()
start = time()
decision.fit(X_train, y_train)
end=time()
train_time_dec=end-start
decc=decision.score(X_test,y_test)
decpredict = decision.predict(X_test)
exp_dec = explained_variance_score(decpredict,y_test)
In [100]:
# Comparing Models on the basis of Model's Accuracy Score and Explained Variance Score of different models
models_cross = pd.DataFrame({
    'Model': ['Gradient Boosting','AdaBoost','Random Forest','Decision Tree'],
    'Score': [gradient,adab,random,decc],
     'Variance Score': [exp_est,exp_ada,exp_rand,exp_dec]})
    
models_cross.sort_values(by='Score', ascending=False)
Out[100]:
Model Score Variance Score
0 Gradient Boosting 0.690107 0.588892
2 Random Forest 0.688659 0.522213
1 AdaBoost 0.317593 -0.737741
3 Decision Tree 0.235349 0.349351

ANALYZING TRAINING TIME EACH MODEL¶

In [101]:
import matplotlib.pyplot as plt
import numpy as np
model = ['Adab54soost', 'GBOOST', 'Random forest', 'Decision Tree']
Train_Time = [
    train_time_ada,
    train_time_g,
    train_time_rand,
    train_time_dec
    
]
index = np.arange(len(model))
plt.bar(index, Train_Time)
plt.xlabel('Machine Learning Models', fontsize=15)
plt.ylabel('Training Time', fontsize=15)
plt.xticks(index, model, fontsize=10, )
plt.title('Comparison of Training Time of all ML models')
plt.show()

From the above figure it is inferred that decision tree has taken negligible amount of time to train where as Randome forest has taken maximum time and it is yet obvious because as we increase the number of tree in this case training time will increase so we should look out for optimal model which has greater accuracy and less training time in comparison to other So, in this case GBoost is the best choice as its accuracy is highest and it is taking less time to train with accuracy.

Conculsion¶

The two datasets provide valuable information about the real estate market in Sydney, Australia, and the suburbs in the area. The "Sydney House Prices" dataset includes information about the sale prices of properties in various suburbs, while the "Sydney Suburbs Reviews" dataset offers insight into the reputation and desirability of each suburb based on customer reviews. By analysing these datasets together, it is possible to draw some conclusions about the relationship between the cost of properties and the popularity or quality of a suburb as a place to live.

For example, the data might show that suburbs with higher-priced properties tend to receive higher ratings and positive reviews from customers. This could indicate that these areas are considered more desirable and better places to live. Conversely, suburbs with lower-priced properties may receive lower ratings or mixed reviews, which could indicate that they are not as highly regarded by customers.

This information can be useful for individuals who are looking to purchase a property in Sydney or considering moving to a new suburb. By examining the data on both the cost of properties and the reputation of the suburbs, potential buyers and residents can make more informed decisions based on their budget and lifestyle preferences. For example, those who prioritize affordability may be more likely to consider suburbs with lower-priced properties, while those who place a higher value on quality of life may be more interested in suburbs with higher ratings and positive reviews.

In conclusion, the two datasets offer valuable information about the Sydney real estate market and suburbs, and by analysing them together, it is possible to gain a better understanding of the relationship between property prices and suburb quality and make more informed decisions about where to live in Sydney.

References¶

Alexlau203. (n.d.). Sydney House Prices [Data set]. Kaggle. https://www.kaggle.com/datasets/alexlau203/sydney-house-prices.

Karltse. (n.d.). Sydney Suburbs Reviews [Data set]. Kaggle. https://www.kaggle.com/datasets/karltse/sydney-suburbs-reviews.

McKinney, W. (2012). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O'Reilly Media, Inc. https://proquest-safaribooksonline-com.ezproxy.lib.ryerson.ca/book/programming/python/9781449319793.

VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data. O'Reilly Media, Inc. https://proquest-safaribooksonline-com.ezproxy.lib.ryerson.ca/book/programming/python/9781491912126.

Matplotlib Development Team. (2021). Matplotlib. https://matplotlib.org.

Seaborn Development Team. (2021). Seaborn: Statistical data visualization. https://seaborn.pydata.org.

Scipy Development Team. (2021). SciPy: Open source scientific tools for Python. https://scipy.org.